Corpora and Data Preparation for Information Extraction
نویسندگان
چکیده
The data selection and data preparation efforts which led to the TIPSTER and Fifth Message Understanding Conference (MUC-5) corpora involved substantial effort, time and resources. The Government commitment to these selection and preparation efforts stems from four TIPSTER Program objectives: (1) to provide training data that would promote the development of information extraction technology, (2) to provide accurate test data to evaluate and baseline system performance in an objective manner, (3) to provide baseline data for human performance to understand and interpret machine performance, and (4) to support the larger Natural Language Processing community by making available a unique set of texts and templates in multiple domains and languages under ARPA support. This commitment was demonstrated through the managerial, technical, and administrative support to these efforts from various Government agencies, as well as through the contractual efforts with the Institute for Defense Analyses for data preparation and New Mexico State University for software tool development.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملCorpora and data preparation
The data selection and data preparation efforts which led to the TIPSTER and Fifth Message Understandin g Conference (MUC-5) evaluation corpora involved substantial effort, time and resources . The Government commitment to these selection and preparation efforts stems from four TIPSTER Program objectives : (1) to provide trainin g data that would promote the development of information extractio...
متن کاملTasks, Domains, and Languages for Information Extraction
The information extraction tasks for the ARPA TIPSTER program center on automatically filling object-oriented data structures, called templates, with information extracted from free text in news stories (for discussion of templates and objects, see "Template Design for Information Extraction" in this volume). With text as input, the TIPSTER systems first detect whether the text contains relevan...
متن کاملTasks, domains, and languages
The Fifth Message Understanding Conference (MUC-5) involved the same tasks, domains and languages as th e information extraction portion of the ARPA TIPSTER program . These tasks center on automatically filling object oriented data structures, called templates, with information extracted from free text in news stories (for discussion o f templates and objects, see "Template Design for Informati...
متن کاملAutomatic extraction of bilingual word pairs using inductive chain learning in various languages
In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficien...
متن کامل